1 Importing the relevant libraries and dataset 🛠️

First, we import the required libraries which we will use to perform the current analysis.

library(tidyverse)
library(naniar)
library(bookdown)
library(stringr)
library(stringi)
library(lubridate)
library(DT)
library(forcats)
library(ggthemes)
library(corrplot)
library(mltools)
library(data.table)
library(visdat)
library(janitor)
library(cowplot)
library(caTools)
library(pscl)
library(ROCR)
library(caret)
library(xgboost)
library(randomForest)
library(lightgbm)
library(Matrix)
library(catboost)
library(magrittr)
library(fmsb)
library(plotly)
library(TTR)
library(broom)

2 Introduction

What are we trying to study ?

Time series forecasting is a powerful analytical technique used to predict future values based on historical data. It plays a crucial role in various domains such as finance, economics, weather forecasting, supply chain management, and more. By analyzing patterns, trends, and dependencies within a time series dataset, forecasting models aim to provide accurate predictions and insights into future behavior.

The first step in time series forecasting is to understand the characteristics of the data. Time series data consists of a sequence of observations collected over time, where each observation is associated with a specific timestamp. These observations may exhibit trends, seasonality, cyclic patterns, or irregularities, which need to be identified and accounted for in the forecasting process.

- The internet

Great ! We have all the libraries loaded. Next, we are gonna load the required dataset for conducting the enzyme classification analysis.

We will use one dataset for the purpose of exploratory data analysis and training the prediction model while the test dataset for testing the prediction model on a completely new dataset.

After reading the data, let us see how the train dataset looks like.

df_train <- read_csv("data/train.csv")
df_test <-  read_csv("data/test.csv")
head(df_train)
## # A tibble: 6 × 6
##      id date       country   store        product                       num_sold
##   <dbl> <date>     <chr>     <chr>        <chr>                            <dbl>
## 1     0 2017-01-01 Argentina Kaggle Learn Using LLMs to Improve Your C…       63
## 2     1 2017-01-01 Argentina Kaggle Learn Using LLMs to Train More LLMs       66
## 3     2 2017-01-01 Argentina Kaggle Learn Using LLMs to Win Friends an…        9
## 4     3 2017-01-01 Argentina Kaggle Learn Using LLMs to Win More Kaggl…       59
## 5     4 2017-01-01 Argentina Kaggle Learn Using LLMs to Write Better          49
## 6     5 2017-01-01 Argentina Kaggle Store Using LLMs to Improve Your C…       88

We observe that the dataset is fairly simple with the following features.

  1. Date: The date of purchase associated for a particular product.
  2. Country: The originating country where the purchase was made.
  3. Store: The store associated with the purchase product.
  4. Product: The type of product purchased.
  5. num_sold: Number of products sold.

3 Data cleaning

3.1 Removal of unnecessary variables

The dataset appears to be fairly simple and concise. We will retain all the available features in this dataset except for the “id” column.

df_train <- df_train %>% select(-id)

3.2 Check for null values

In this step, we will try to check for the presence of null values in the dataset.

gg_miss_var(df_train)
Missingness in the dataset

Figure 3.1: Missingness in the dataset

Based on the figure 3.1, we can observe that

✅ The dataset does not contain any missing values. This indicates that we have a clean dataset which is ready for EDA and further analysis.

4 Exploratory Data Analysis

In this section, we will try to visualise the various features and try to obtain key insights through the usage of these visualisations.

4.1 Sales in each country

Let us try to observe the number of product sales for each country.

df_sales_count <- df_train %>% group_by(country) %>% summarise(count = n())
pl1 <- ggplot(data = df_sales_count,aes(x = country,y = count,fill = country)) + geom_col(color = 'black') + theme_classic() + geom_label(aes(label = count)) + labs(x = "Country",y = "Number of products sold") + ggtitle("Country wise distribution of sales") + theme(legend.position = 'none',plot.title = element_text(hjust = 0.5))
pl1
Country wise distribution of sales

Figure 4.1: Country wise distribution of sales

Based on figure 4.1, we can observe that,

💡 the dataset contains equally distributed number of sales for each country. This is ideal to create our prediction model as the model can be trained better without the presence of any bias originating through hetergenous data. 💡

4.2 Global sales

df_date_sale <- df_train %>% group_by(date) %>% summarise(tot_sold =  sum(num_sold))


pl2 <- ggplot(data = df_date_sale,aes(x = date,y = tot_sold),group = date) + geom_line(color = 'blue') + theme_classic() + ggtitle("Total sales globally") + labs(y = "Total sales",x = "Date of purchase") + theme(plot.title = element_text(hjust = 0.5)) +
    annotate("segment",x = ymd(20200101),
    y = 5500,xend = ymd(20200401) ,
    yend = 8000 ,arrow = arrow(type = "closed", 
                              length = unit(0.02, "npc"))
  ) +
  annotate("text",x = ymd(20200101),
    y = 5000,colour = "red",
    label = 'Dip in total sales',
    size = unit(3, "pt")) 

pl2
Total courses sold

Figure 4.2: Total courses sold

4.3 Trend line of total global sales

While we have observed the total global sales in section 4.2, let us observe the overall trend line using a simple moving average function.

df_date_sale_sma <- df_date_sale %>% SMA(n=7)

plot.ts(df_date_sale_sma)
title("Trend line of global sales \n with 1 week moving average")

Based on figure 4.2, we can observe that

💡 there is a strong seasonality observed in the data. The sales are observed to peak during the period of new year everytime. However, an unexpected drop in sales were observed in the year of 2020. The sales could be affected as a result of COVID-19 restrictions. 💡

4.4 Sales in each country

df_date_sale_country <- df_train %>% group_by(date,country) %>% summarise(tot_sold =  sum(num_sold))


pl4 <-ggplot(data = df_date_sale_country,
         aes(x = date, y = tot_sold, color = country),
         group = date) + geom_line() + theme_classic() + ggtitle("Total sales in all countries") + labs(y = "Total sales", x = "Date of purchase", color =
                                                                                                          "Country") + theme(plot.title = element_text(hjust = 0.5))

pl4
Total courses sold in each country

Figure 4.3: Total courses sold in each country

Based on figure 4.3, we can observe that

💡 the strong seasonality is observed equally in each of the 5 countries. The peaks and troughs are observed to appear around the same time of the year for all the countries. The sales were observed to be the highest for Canada, followed by Japan,Spain, Estonia and Argentina. The sales were particularly underwhelming in the country of Argentina. 💡

4.5 Product wise sales

Let us observe the product wise sales in the following visualisation.

df_prod_sale <- df_train %>% group_by(date,product) %>% summarise(tot_sold =  sum(num_sold))

pl5 <-ggplot(data = df_prod_sale,
         aes(x = date, y = tot_sold, color = product),
         group = date) + geom_line(alpha = 0.7) + theme_classic() + ggtitle("Total sales of products in all countries") + labs(y = "Total sales", x = "Date of purchase", color =
                                                                                                          "Product") + theme(legend.position = 'none')

ggplotly(pl5)

Figure 4.4: Product wise sales globally

Based on figure 4.4, we can observe that

💡 there is a sinusoidal seasonality in the sales of most Kaggle products. However, the product “Using LLMs to Win Friends and Influence People” does not show much seasonality and have much lower sales compared to the rest of the products. 💡

4.6 Store wise sales

After observing the sales in terms of products and location, let us check how do the sales fare for each store of Kaggle.

df_store_sale <- df_train %>% group_by(date,store) %>% summarise(tot_sold =  sum(num_sold))

pl6 <-ggplot(data = df_store_sale,
         aes(x = date, y = tot_sold, color = store),
         group = date) + geom_line(alpha = 0.7) + theme_classic() + ggtitle("Total sales of each store") + labs(y = "Total sales", x = "Date of purchase", color =
                                                                                                          "Store") + theme(plot.title = element_text(hjust = 0.5))

ggplotly(pl6)

Figure 4.5: Total sales of each store

Upon analysing figure 4.5, we can observe that

💡 the seasonal peaks in the sales for each of the Kaggle store are in close synchonisation to each other. However, there is a distinct difference in the volume of sales for each store. It can be observed that the sales for “Kagglazon” are signficantly higher than the “Kaggle Store” and the “Kaggle Learn” stores. 💡

5 Data Wrangling

5.1 Feature Engineering

After analysing the data through our visualisations in the previous sections, we can start preparing the dataset for the ML algorithms. This would require us to tweak the data into a tidy format by transforming the same.

This involves converting categorical data such as Country, Store and Products into encoded data.

df_train$country <- factor(df_train$country)
df_train$store <- factor(df_train$store)
df_train$product <- factor(df_train$product)
dt_train <- data.table(df_train)
dt_train <- one_hot(dt_train,cols = c("country","store","product"))

df_train <- as.data.frame(dt_train)

✅ All right ! We have finally prepaared our dataset. In the next step, we will now separate the target label from the input dataframe for the purpose of training and testing our prediction model.

5.2 Train and test dataset preparation

The datasets for training and testing will now be prepared.

set.seed(101)
sample=sample.split(df_train$num_sold,SplitRatio=0.7)
train=subset(df_train,sample==T)
test=subset(df_train,sample==F)

6 Predictive modeling

6.1 Linear Regression

Let us utilise the linear regression technique to predict the number of sold products.

model_lr <- lm(num_sold~.,data=train)
glance(model_lr)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df   logLik      AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>    <dbl>  <dbl>
## 1     0.750         0.750  92.1    26107.       0    11 -569619. 1139264. 1.14e6
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

As we can observe,

💡 The linear regression model fared moderately while predicting the total number of sales with the model having an R-squared score of 75%. 💡

model_aug <- augment(model_lr)
fitted.results <- predict(model_lr,newdata=subset(test,select=-(num_sold)))

After fitting the linear regression model on the train dataset and predicting the values with the test dataset, let us see how do the fitted and actual values vary.

df_lr <- as.data.frame(test$num_sold)
df_lr <- df_lr %>% rename("Actual_values" = "test$num_sold")
df_lr$fitted <- fitted.results

pl7 <- ggplot(data = df_lr,aes(x = Actual_values,y = fitted)) + geom_point() + geom_smooth(method = 'lm',aes(color = "Linear regression prediction")) + theme_classic() + labs(x="Actual values",y = "Predicted values",color = "Model") + ggtitle("Predicted and actual values \n in Linear Regression model") + theme(plot.title = element_text(hjust=0.5))
pl7
Predicted and actual values in Linear Regression model

Figure 6.1: Predicted and actual values in Linear Regression model

Based on figure 6.1,

💡 we can observe that the linear regression does not do a great job at predicting the number of sales. This could be as a result of the fact that the linear regression model is sensitive to outliers. Another reason can be due to the fact that not all phenomena and circumstances can be accurately described by a linear regression model. The current problem statement maybe poorly described by a linear model. 💡